Search CORE

751 research outputs found

Parallelising wavefront applications on general-purpose GPU devices

Author: Hammond Simon D.
Jarvis Stephen A.
Mudalige Gihan R.
Pennycook Simon J.
Publication venue: Performance Computing and Visualisation, Department of Computer Science, University of Warwick
Publication date: 01/07/2010
Field of study

Pipelined wavefront applications form a large portion of the high performance scientific computing workloads at supercomputing centres. This paper investigates the viability of graphics processing units (GPUs) for the acceleration of these codes, using NVIDIA's Compute Unified Device Architecture (CUDA). We identify the optimisations suitable for this new architecture and quantify the characteristics of those wavefront codes that are likely to experience speedups

Warwick Research Archives Portal Repository

Experiences with porting and modelling wavefront algorithms on many-core architectures

Author: Hammond Simon D.
Jarvis Stephen A.
Mudalige Gihan R.
Pennycook Simon J.
Publication venue
Publication date: 01/09/2010
Field of study

We are currently investigating the viability of many-core architectures for the acceleration of wavefront applications and this report focuses on graphics processing units (GPUs) in particular. To this end, we have implemented NASA’s LU benchmark – a real world production-grade application – on GPUs employing NVIDIA’s Compute Unified Device Architecture (CUDA). This GPU implementation of the benchmark has been used to investigate the performance of a selection of GPUs, ranging from workstation-grade commodity GPUs to the HPC "Tesla” and "Fermi” GPUs. We have also compared the performance of the GPU solution at scale to that of traditional high perfor- mance computing (HPC) clusters based on a range of multi- core CPUs from a number of major vendors, including Intel (Nehalem), AMD (Opteron) and IBM (PowerPC). In previous work we have developed a predictive “plug-and-play” performance model of this class of application running on such clusters, in which CPUs communicate via the Message Passing Interface (MPI). By extending this model to also capture the performance behaviour of GPUs, we are able to: (1) comment on the effects that architectural changes will have on the performance of single-GPU solutions, and (2) make projections regarding the performance of multi-GPU solutions at larger scale

Warwick Research Archives Portal Repository

WMTrace : a lightweight memory allocation tracker and analysis framework

Author: Hammond Simon D.
Jarvis Stephen A.
Pennycook Simon J.
Perks O. F. J.
Publication venue
Publication date: 01/07/2011
Field of study

The diverging gap between processor and memory performance has been a well discussed aspect of computer architecture literature for some years. The use of multi-core processor designs has, however, brought new problems to the design of memory architectures - increased core density without matched improvement in memory capacity is reduc- ing the available memory per parallel process. Multiple cores accessing memory simultaneously degrades performance as a result of resource con- tention for memory channels and physical DIMMs. These issues combine to ensure that memory remains an on-going challenge in the design of parallel algorithms which scale. In this paper we present WMTrace, a lightweight tool to trace and analyse memory allocation events in parallel applications. This tool is able to dynamically link to pre-existing application binaries requiring no source code modification or recompilation. A post-execution analysis stage enables in-depth analysis of traces to be performed allowing memory allocations to be analysed by time, size or function. The second half of this paper features a case study in which we apply WMTrace to five parallel scientific applications and benchmarks, demonstrating its effectiveness at recording high-water mark memory consumption as well as memory use per-function over time. An in-depth analysis is provided for an unstructured mesh benchmark which reveals significant memory allocation imbalance across its participating processes

Warwick Research Archives Portal Repository

On the acceleration of wavefront applications using distributed many-core architectures

Author: Hammond Simon D.
Jarvis Stephen A.
Mudalige Gihan R.
Pennycook Simon J.
Wright Steven A.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/02/2012
Field of study

In this paper we investigate the use of distributed graphics processing unit (GPU)-based architectures to accelerate pipelined wavefront applications—a ubiquitous class of parallel algorithms used for the solution of a number of scientific and engineering applications. Specifically, we employ a recently developed port of the LU solver (from the NAS Parallel Benchmark suite) to investigate the performance of these algorithms on high-performance computing solutions from NVIDIA (Tesla C1060 and C2050) as well as on traditional clusters (AMD/InfiniBand and IBM BlueGene/P). Benchmark results are presented for problem classes A to C and a recently developed performance model is used to provide projections for problem classes D and E, the latter of which represents a billion-cell problem. Our results demonstrate that while the theoretical performance of GPU solutions will far exceed those of many traditional technologies, the sustained application performance is currently comparable for scientific wavefront applications. Finally, a breakdown of the GPU solution is conducted, exposing PCIe overheads and decomposition constraints. A new k-blocking strategy is proposed to improve the future performance of this class of algorithm on GPU-based architectures

CiteSeerX

University of Birmingham Research Portal

Warwick Research Archives Portal Repository

White Rose Research Online

An investigation of the performance portability of OpenCL

Author: Hammond Simon D.
Herdman J. A.
Jarvis Stephen A.
Miller I.
Pennycook Simon J.
Wright Steven A.
Publication venue: 'Elsevier BV'
Publication date: 11/08/2012
Field of study

This paper reports on the development of an MPI/OpenCL implementation of LU, an application-level benchmark from the NAS Parallel Benchmark Suite. An account of the design decisions addressed during the development of this code is presented, demonstrating the importance of memory arrangement and work-item/work-group distribution strategies when applications are deployed on different device types. The resulting platform-agnostic, single source application is benchmarked on a number of different architectures, and is shown to be 1.3–1.5× slower than native FORTRAN 77 or CUDA implementations on a single node and 1.3–3.1× slower on multiple nodes. We also explore the potential performance gains of OpenCL’s device fissioning capability, demonstrating up to a 3× speed-up over our original OpenCL implementation

Warwick Research Archives Portal Repository

Predictive analysis of a hydrodynamics application on large-scale CMP clusters

Author: Davis J. A.
Hammond Simon D.
Herdman J. A.
Jarvis Stephen A.
Miller I.
Mudalige Gihan R.
Publication venue: Springer
Publication date
Field of study

We present the development of a predictive performance model for the high-performance computing code Hydra, a hydrodynamics benchmark developed and maintained by the United Kingdom Atomic Weapons Establishment (AWE). The developed model elucidates the parallel computation of Hydra, with which it is possible to predict its runtime and scaling performance on varying large-scale chip multiprocessor (CMP) clusters. A key feature of the model is its granularity; with the model we are able to separate the contributing costs, including computation, point-to-point communications, collectives, message buffering and message synchronisation. The predictions are validated on two contrasting large-scale HPC systems, an AMD Opteron/ InfiniBand cluster and an IBM BlueGene/P, both of which are located at the Lawrence Livermore National Laboratory (LLNL) in the US. We validate the model on up to 2,048 cores, where it achieves a > 85% accuracy in weak-scaling studies. We also demonstrate use of the model in exposing the increasing costs of collectives for this application, and also the influence of node density on network accesses, therefore highlighting the impact of machine choice when running this hydrodynamics application at scale

Warwick Research Archives Portal Repository

7th international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 2016)

Author: Hammond Simon D.
Jarvis Stephen A.
Wright Steven A.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 02/02/2017
Field of study

University of Birmingham Research Portal

White Rose Research Online

Integrating design planning, schedule and control with Deplan

Author: Glenn Ballard (7177112)
Hyun Jeong Choo (7177106)
Iris D. Tommelein (7177109)
Jamie W. Hammond (7176974)
Simon Austin (1251210)
Publication venue
Publication date: 01/01/2000
Field of study

The planning and management of building design has historically been focused upon traditional methods of planning such as Critical Path Method (CPM). Little effort is made to understand the complexities of the design process; instead design managers focus on allocating work packages where the planned output is a set of deliverables. All too often there is no attempt to understand and control the flow of information that gives rise to these deliverables. This paper proposes the combined use of the Analytical Design Planning Technique (ADePT) and Last Planner methodology as a tool called DesPlan to improve the planning, scheduling and control of design. ADePT is applied during the early planning stages to provide the design team with an improved design programme that takes into account the complex relationships that exist between designers, and the information flows that flows between them. Then the Last Planner methodology is employed, through a program called ProPlan, to schedule and control the design environment

Loughborough University Institutional Repository